Reference: Edward Tufte (2006) - Beautiful Evidence https://www.edwardtufte.com
Why do we use graphs in data analysis?
Characteristics of exploratory graphs
Air pollution in the United States
Annual average pm2.5 averaged over the period 2008 through 2010
if(!file.exists('./data')){dir.create('./data')}
fileUrl <- 'https://raw.githubusercontent.com/jtleek/modules/master/04_ExploratoryAnalysis/exploratoryGraphs/data/avgpm25.csv'
download.file(fileUrl,
destfile = './data/avgpm25.csv',
method = 'curl')
pollution <- read.csv('data/avgpm25.csv',
colClasses = c('numeric', 'character', 'factor', 'numeric', 'numeric'))
head(pollution)
## pm25 fips region longitude latitude
## 1 9.771185 01003 east -87.74826 30.59278
## 2 9.993817 01027 east -85.84286 33.26581
## 3 10.688618 01033 east -87.72596 34.73148
## 4 11.337424 01049 east -85.79892 34.45913
## 5 12.119764 01055 east -86.03212 34.01860
## 6 10.827805 01069 east -85.35039 31.18973
The features:
pm25 - the level of pm2.5, the anual mean
averaged over the past 3 years (2008-2010).fips - identifier of the countyregion - EAST or WEST region of the countylongitude - the longitude of the monitor in that
countylatitude - the latitude of the monitor in that
countyThe underlying question: Do any counties exceed the standard of 12\(\mu\)g/m\(^3\)?
- the _min_, _1st quantile_, _median_, _mean_, _3rd quantile_, _max_ of any of the dataframe variables
- `summary(df$var)`
- for the air pollution example:
```r
summary(pollution$pm25)
```
```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.383 8.549 10.047 9.836 11.356 18.441
```
```r
summary(pollution$longitude)
```
```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -158.04 -97.38 -87.37 -91.65 -80.72 -68.26
```
```r
summary(pollution$latitude)
```
```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.68 35.30 39.09 38.56 41.75 64.82
```
- Graphical representation of the six number summary (actually five number summary)
- `boxplot(df$var)`
- for the air pollution example:
boxplot(pollution$pm25, col = "wheat")
hist(df$var) and
rug(df$var) - for the air pollution example: hist(pollution$pm25, col = "wheat")
rug(pollution$pm25)
- changing the histogram breaks
hist(pollution$pm25, col = "wheat", breaks = 100)
rug(pollution$pm25)
boxplot(pollution$pm25, col = "wheat")
abline(h = 12) # overlying the horizontal line at the level of the national air quality standard.
hist(pollution$pm25,
col = "wheat")
abline(v = 12, lwd = 2) # overlying the horizontal line at the level of the national air quality standard.
abline(v = median(pollution$pm25),
col = "magenta",
lwd = 2)
- graphical summary for categorical data
- `barplot(df$var)`
- for the air pollution example:
barplot(table(pollution$region),
col = "wheat",
main = 'Number of counties in each region')
For two dimensions:
For more than two dimensions:
boxplot(pm25 ~ region,
data = pollution,
col = "wheat")
par(mfrow = c(2,1),
mar = c(4,4,2,1))
hist(subset(pollution, region == "east")$pm25,
col = 'wheat')
hist(subset(pollution, region == "west")$pm25,
col = 'wheat')
with(pollution, plot(latitude, pm25))
abline(h = 12, lwd = 2, lty = 2)
with(pollution, plot(latitude, pm25, col = region))
abline(h = 12, lwd = 2, lty = 2)
par(mfrow = c(1,2),
mar = c(5,4,2,1))
with(subset(pollution, region == 'west'),
plot(latitude, pm25, main = "West"))
abline(h = 12, lwd = 2, lty = 2)
with(subset(pollution, region == 'east'),
plot(latitude, pm25, main = "East"))
abline(h = 12, lwd = 2, lty = 2)
Three core plotting systems:
text,
lines, points, axis)R commandslibrary(datasets)
data("cars")
with(cars, plot(speed, dist))
R is
encapsulated in the following packages:
graphics - contains plotting functions for the
“base” graphing systems, including: plot,
hist, boxplot and many others
grDevices - contains all the code implementing the
various graphics devices, including X11, PDF, PostScript, PNG,
etc
latticepackagexyplot,
bwplot, etc).library(lattice)
state <- data.frame(state.x77, region = state.region)
xyplot(Life.Exp ~ Income | region,
data = state,
layout = c(4, 1))
lattice - contains cod for producing Trellis
graphcis, which are independent of the “base” graphics system; include
functions like xyplot, bwplot,
levelplot
grid - implements a different graphing system
independent of the “base”; the lattice package builds on
top of grid
library(ggplot2)
data(mpg)
qplot(displ, hwy, data = mpg)
plot(x,y) or hist(x) will launch a
graphics device and draw a new plot on the deviceplot is called; this function
has many arguments, letting you set the title, x axis label,
…?par.library(datasets)
hist(airquality$Ozone) ## Draw a new plot
library(datasets)
with(airquality, plot(Wind, Ozone))
library(datasets)
airquality <- transform(airquality, Month = factor(Month))
boxplot(Ozone ~ Month,
airquality,
xlab = "Month",
ylab = "Ozone (ppb)")
Default values for the global graphics parameters:
par('lty')
## [1] "solid"
par('col')
## [1] "black"
par('pch')
## [1] 1
par('bg')
## [1] "white"
par('mar')
## [1] 5.1 4.1 4.1 2.1
par('mfrow')
## [1] 1 1
plot - makes a scatterplot or other type of plot
depending on the class of the object being plotted.lines - add lines to a plot given a vector x values and
a correspodning vectory of y values (or a 2 column matrix); this
function just connects the dots.points - add points to a plottext - add text lables to a plot using specified x,y
coordinatestitle - add annotations to x, y axis labels, title,
subtitle, outer marginmtext - add arbitrary text to the margins (inner or
outer) of a plotaxis - adding axis ticks/labelsAdding title
library(datasets)
with(airquality, plot(Wind, Ozone))
title(main = 'Ozone and Wind in NY city') ## Add a title
Adding title directly in the plot() function
with(airquality, plot(Wind, Ozone,
main = 'Ozone and Wind in NY City'))
with(subset(airquality, Month == 5), points(Wind, Ozone, col = 'blue'))
Adding legend to the plot
with(airquality, plot(Wind, Ozone,
main = 'Ozone and Wind in NY City'))
with(subset(airquality, Month == 5), points(Wind, Ozone, col = 'blue'))
with(subset(airquality, Month != 5), points(Wind, Ozone, col = "red"))
legend("topright", pch = 1,
col = c('blue', "red"),
legend = c("May", "Other Months"))
Adding a regression line to the plot
with(airquality, plot(Wind, Ozone,
main = 'Ozone and Wind in NY City'), pch = 20)
model <- lm(Ozone ~ Wind, airquality)
abline(model, lwd = 2)
Multiple base plots
par(mfrow = c(1,2))
with(airquality,{
plot(Wind, Ozone, main = 'Ozone and Wind')
plot(Solar.R, Ozone, maine = 'Ozone and Solar Radiation')
})
par(mfrow = c(1,3), mar = c(4,4,2,1), oma = c(0,0,2,0))
with(airquality,{
plot(Wind, Ozone, main = 'Ozone and Wind')
plot(Solar.R, Ozone, maine = 'Ozone and Solar Radiation')
plot(Temp, Ozone, maine = 'Ozone and Temperature')
mtext("Ozone and Weather in NY", outer = TRUE)
})
x <- rnorm(100)
hist(x)
Even no parameters were specified: * the title appears (histogram of x) * the x-axis label (the variable x) * the y-axis label (Frequency)
x <- rnorm(100)
y <- rnorm(100)
plot(x ,y)
Even no parameters were specified: * the x-axis label (the variable x) * the y-axis label (the variable y)
The regions of the plot: * The regions : - side 1 (bottom) - side 2
(left) - side 3 (up) - side 4 (right)
* Can be adjusested via plot(mar = c(m1, m2, m3, m4))
par(mfrow = c(1, 3))
plot(x, y, pch = 2)
plot(x, y, pch = 3)
plot(x, y, pch = 4)
plot(x, y, pch = 20)
title("Scatterplot")
text(-2.5, 2, "Added text")
legend("topright", legend = 'Data', pch = 20)
fit <- lm(x~y)
abline(fit, lwd = 2, col = 'blue')
plot(x, y, xlab = "weight", ylab = "height", pch = 20)
title("Scatterplot")
text(-2.5, 2, "Added text")
legend("topright", legend = 'Data', pch = 20)
fit <- lm(x~y)
abline(fit, lwd = 2, col = 'blue')
x <- rnorm(100)
y <- rnorm(100)
z <- rpois(100, 2)
par(mfrow = c(2,1), mar = c(2,2,1,1))
plot(x, y, pch = 20)
plot(x, z, pch = 20)
x <- rnorm(100)
y <- rnorm(100)
z <- rpois(100, 2)
par(mfrow = c(2,2), mar = c(2,2,1,1))
plot(x, y, pch = 20)
plot(z, x, pch = 20)
plot(x, z, pch = 20)
plot(y, z, pch = 20)
x <- rnorm(100,5,2)
noise <- rnorm(100)
y <- x + noise
## Generate factor levels
g <- gl(2,50, labels = c("Male", "Female"))
plot(x, y, type = 'n')
points(x[g == 'Male'], y[g == 'Male'],
pch = 19, col = 'darkmagenta')
points(x[g == 'Female'], y[g == 'Female'],
pch = 19, col = 'darkorange')
legend("topright", pch = 19,
legend = c("Male", 'Female'),
col = c('darkmagenta', 'darkorange'))
quartz().windows().x11().?Devices.plot in base, xyplot in
latice or qplot in ggplot2 will default to sending a plot
to the screen device.There are two basic approaches to plotting.
plot or
qplot ).library(datasets)
with(faithful, plot(eruptions, waiting,
col = 'darkorange')) ## Make plot appear on the screen device
title(main = 'Old Faithful Geyser data')
dev.off() ( this is very important ).pdf(file = 'myplot.pdf') ## Open PDF device, create "myplot.pdf" in my working directory
with(faithful, plot(eruptions, waiting,
col = 'darkorange')) # Create plot and send to a file (no plot appears on screen)
title(main = 'Old Faithful Geyser data') # Annotate
dev.off() #Close the PDF file device
## quartz_off_screen
## 2
## Now you can view the file 'myplot.pdf' on your computer
Two basic types of file devices: vector and bitmap devices
dev.cur().dev.set(<integer>) where <integer>
is the number associated with the grpahics device you want to switch
to.dev.copy - copy a plot from one device to anotherdev.copy2pdf - specifcially copy a plot to a PDF
fileCopying a plot is not an exact opeation, the result may not be identical.
library(dataset)
with(faithful, plot(eruptions, waiting, col = 'brown4')) ## Creating the plot
title(main = 'Old Faithful Geyser data') ## Annotating the plot
dev.copy(png, file = "geyser.png") ## Copying the plot to a PNG file
## quartz_off_screen
## 3
dev.off() ## Clossing the PNG device
## quartz_off_screen
## 2